We are using the same dataset from our first project. The dataset contains 9578 observations and 14 variables and is based on loans from LendingClub, a peer-to-peer lending company, that were made from 2007 to 2015. In summary, our EDA found:
credit.policy and other numeric variables such as
int.rate, fico, and
inq.last.6mths. Additionally, there was no clearly
established relationships between not.fully.paid and other
numeric variables (except for credit.policy).purpose,
credit.policy, and not.fully.paid.delinq.2yrs, their
mean significantly varies for different categories
ofpurpose.We obtained the dataset from Kaggle here: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis
Our work is stored on our team GitHub here: https://github.com/jschild01/JMB_DATS_6101
From our EDA in project 1 we know that the credit.policy
and not.fully.paid variables function as logicals, and the
purpose variable functions as a factor with 7 levels. We
will start by formally converting these variables from numerical
date-types in R to logicals and the factor date-type, respectively.
# We determined this makes sense to do from our EDA
str(loans) # keep??
loans$credit.policy <- as.logical(loans$credit.policy)
loans$not.fully.paid <- as.logical(loans$not.fully.paid)
loans$purpose <- as.factor(loans$purpose)
Of the 14 variables in the dataset, 11 are numeric, 2 are logical, and 1 is a factor.
From our EDA we saw that the overwhelming majority of loans had a
value of 0 for delinq.2yrs and pub.rec. The
variables might be more useful in terms of having stronger correlations
and a larger impact in models if we create logical versions of these
variables. We will introduce has.delinq.2yrs and
has.pub.rec based on whether or not the value is 0 or
greater than 0.
# It might help to turn these variables into logicals since most are 0
loans <- loans %>%
mutate(has.delinq.2yrs = delinq.2yrs > 0,
has.pub.rec = pub.rec > 0)
From our EDA we saw that the revol.bal variable had a
wide range of values as well as outlier issues. We wondered if it would
work better if we took the natural log of it, similar to how the income
provided in the dataset the dataset (the log.annual.inc
variable) is in the form of the natural log. However, some values for
revol.bal are 0, and taking the natural log of that results
in -Inf. One way to deal with this without removing those values is to
add 1 to all of the values for revol.bal before taking the
natural log.
Given the range of these values (the IQr is 1.50625^{4}) adding one
should have a negligible effect, and adding one to all values keeps the
data consistent. In case it is significant that the
revol.bal is 0 we will add a logical variable
has.revol.bal based on whether the value of
revol.bal is 0 or greater than 0.
# Some loans have a revol.bal value of 0, the natural log of which is -Inf
# Adding 1 to all values before taking the natural log resolves this issue by nudging the value a very small amount
# In case it was meaningful if the loan had a revol.bal value of 0 versus >0, we can add a logical to account for that
loans <- loans %>%
mutate(log.revol.bal = log(revol.bal+1),
has.revol.bal = revol.bal > 0)
Now that we have added some new variables, it would be helpful to look at a correlation plot and make some comparisons to assess whether they would be more useful going forward than their original forms.
# A new correlation matrix with the new and transformed variables
# For our correlation matrix we want to include everything but the purpose variable
# We can put the new variables together at the top
loans_correlation_matrix_new <- loans %>%
select(-purpose) %>%
select(revol.bal, log.revol.bal, has.revol.bal, has.delinq.2yrs, has.pub.rec, everything()) %>%
cor()
# The mixed correlation plot makes a nice visualization
corrplot.mixed(loans_correlation_matrix_new, tl.pos = "lt")
Some notable observations of this: 1. has.delinq.2yrs
doesn’t have any strong correlations, and the 2 potentially useful ones
(int.rate and fico) are almost exactly the
same as delinq.2yrs. 2. Similarly, has.pub.rec
doesn’t have any strong correlations, and the 2 potentially useful ones
(int.rate and fico) are the same as
pub.rec. 3. log.revol.bal has a stronger
correlation with dti and a much stronger correlation with
revol.util compared to revol.bal. 4.
log.revol.bal has a weaker correlation with
credit.policy compared to revol.bal. 5.
has.revol.bal does not have a significant correlation with
anything (log.revol.bal excepted) except
revol.util, where the correlation is significantly weaker
than that of log.revol.bal.
From this we can conclude that converting delinq.2yrs
and pub.rec to logicals did not add any advantage for our
dataset. We can also conclude that the has.revol.bal
variable is not necessary and will not add value to our analysis beyond
what log.revol.bal already covers. Successfully eliminating
these as possibilities increases our confidence in the use of their
original structures.
Lastly, we see that taking the natural log of revol.bal
might be useful, as log.revol.bal has some stronger
correlations than revol.bal. We will note this as we
proceed with our modeling.
These are updated histograms, boxplots, and Q-Q plots with the log.revol.bal added variable. We will use these to briefly review the EDA we did for the first project.
Since one of our variables of intrest, credit.policy is
a logical, we can treat is as a 2-level factor and build a
classification tree model using the variables it has significant
correlation with, in this case int.rate, fico,
and inq.last.6mths.
For this variable we are interested in the loans where the value is
FALSE. That is, loans that do not meet the credit
underwriting criteria, and therefore are more at risk of defaulting.
These loans are in the minority. We will rename the FALSE
values to Fail, indicating that the loan fails to meet the
underwriting criteria, and rename the TRUE values to
Meets, indicating that the loan meets the underwriting
criteria.
Building the classification tree, we can see that
int.rate was not used in the model, but fico,
and inq.last.6mths were.
# Since we are renaming some values let's make a separate dataset for this model
loans_tree <- loans
# First convert the credit.policy variable from a logical to a factor, then rename the levels
loans_tree$credit.policy <- as.factor(loans_tree$credit.policy) %>%
fct_recode(Meets = "TRUE", Fails = "FALSE")
# Max depth of 4 based on Professor Faruqe's instructions
creditfit <- rpart(credit.policy ~ int.rate + fico + inq.last.6mths, data=loans_tree, method="class", control = list(maxdepth = 4) )
printcp(creditfit) # display the results
##
## Classification tree:
## rpart(formula = credit.policy ~ int.rate + fico + inq.last.6mths,
## data = loans_tree, method = "class", control = list(maxdepth = 4))
##
## Variables actually used in tree construction:
## [1] fico inq.last.6mths
##
## Root node error: 1868/9578 = 0.19503
##
## n= 9578
##
## CP nsplit rel error xerror xstd
## 1 0.484475 0 1.00000 1.00000 0.020759
## 2 0.187366 1 0.51552 0.51552 0.015755
## 3 0.076017 2 0.32816 0.32816 0.012823
## 4 0.010000 3 0.25214 0.25214 0.011329
plotcp(creditfit) # visualize cross-validation results
# Show these results??
summary(creditfit) # detailed summary of splits
To add: comments on results
We can better visualize the tree by plotting it out. We can see the
inq.last.6mths threshold of greater than or equal to four,
as well as the two thresholds of 740 and 660 or FICO
scores. It details that of our 9578 loan sample, there are 1868 loans
that failed to meet the credit.policy and 7710 that
did.
# plot the tree and add text
rpart.plot(creditfit, uniform=TRUE, main="Classification Tree for Credit.Policy", digits = 3, extra = 1)
#plot(creditfit, uniform=TRUE, main="Classification Tree for credit.policy")
#text(creditfit, use.n=TRUE, all=TRUE, cex=.75)
Of all of the loans, 1231 had more greater than or equal to four
inq.last.6mths. Of those 1231:
FICO lower than 740 and failed to meet the
credit.policyFICO score 740 or higher and failed to meet
the credit.policyFICO score 740 or higher and met the
credit.policyThis indicates that applicants will likely need a FICO
score of 740 or higher if they have greater than or equal to four
inq.last.6mths in order to anticipate meeting the
credit.policy.
Conversely, of all of the loans, 8347 had less than four
inq.last.6mths. Of those 8347:
FICO lower than 660 and failed to meet the
credit.policyFICO lower than 660 but still managed to meet
the credit.policyFICO score 660 or higher and failed to meet
the credit.policyFICO score 660 or higher and met the
credit.policyThis indicates that applicants will likely need a FICO
score of only 660 or higher if they have less than four
inq.last.6mths in order to anticipate meeting the
credit.policy.
Overall, we can see that less inq.last.6mths aligns with
lower FICO score requirements for the
credit.policy. Generally, if applicants have equal to or
more than four inq.last.6mths, then they will likely need a
FICO score of 740 or higher to meet the
credit.policy. On the other hand, if applicants have less
than four inq.last.6mths, then they will likely only need a
FICO score of 660 or higher to meet the
credit.policy. If applicants have a FICO score
lower than 660, then they are unlikely to meet the
credit.policy, regardless.
# create a postscript plot of tree
# The rpart.plot() function looks quite nice, so we may not need this
post(creditfit, file = "credittree2.ps", title = "Classification Tree for credit.policy")
To better understand how our model performs we can construct a
confusion matrix of the results. The Fails values will
function as the positive cases for the confusion matrix we put together
based on the results of the classification tree model.
# Creating the confusion matrix
confusion_matrix = confusionMatrix(predict(creditfit, type = "class"), reference = loans_tree$credit.policy)
print('Overall: ')
## [1] "Overall: "
confusion_matrix$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 9.508248e-01 8.269988e-01 9.463031e-01 9.550698e-01 8.049697e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 2.836114e-102
print('Class: ')
## [1] "Class: "
confusion_matrix$byClass
## Sensitivity Specificity Pos Pred Value
## 0.7489293 0.9997406 0.9985724
## Neg Pred Value Precision Recall
## 0.9426440 0.9985724 0.7489293
## F1 Prevalence Detection Rate
## 0.8559192 0.1950303 0.1460639
## Detection Prevalence Balanced Accuracy
## 0.1462727 0.8743350
#The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves.According to their scheme a value < 0 is indicating no agreement , 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement.
Based on the Kappa statistic of 0.83 we would evaluate this classifier as almost perfect agreement.
To add: other comments on the overall stats of the model.
From the Trees.rmd file: The overall accuracy is 95.08%. These are the same metrics of sensitivity (also known as recall rate, TP / (TP+FN) ), specificity (TN / (TN+FP) ), F1 score, and others that we used in Logistic Regression and KNN analyses. Indeed, any “classifiers” can use the confusion matrix approach as one of the evaluation tools.
Let’s put together tables to visualize the confusion matrix as well some statistics on it.
# Creating the confusion matrix table
# To get a table that follows the same format as our notes, with the rows representing true values and columns representing predicted values, we will have to adjust the data a bit.
# Start by converting to a data.frame
cm_table <- as.data.frame(confusion_matrix$table) %>%
rename(Actual = "Reference")
# Make it so that the rows represent true values and columns represent predicted values
cm_table$Prediction <- paste0("Prediction - ", cm_table$Prediction)
cm_table <- spread(cm_table, Prediction, Freq)
# Output a nice table
# xkabledply(cm_table, title = "Confusion matrix for the tree model")
cm_table %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| Actual | Prediction - Fails | Prediction - Meets |
|---|---|---|
| Fails | 1399 | 469 |
| Meets | 2 | 7708 |
## Can we bold the values in the first column ("Actual") using kable??
#Creating a table for the confusion matrix statistics
# Start by converting to a data.frame
cm_stats <- as.data.frame(confusion_matrix$byClass) %>%
rownames_to_column()
# Adjust the column names and values to look better
colnames(cm_stats) <- c("Statistic", "Percentage")
cm_stats$Percentage <- paste0(round(cm_stats$Percentage*100, digits=1), "%")
# Output a nice table
# xkabledply(cm_stats, title = "Confusion matrix statistics for the tree model")
cm_stats %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| Statistic | Percentage |
|---|---|
| Sensitivity | 74.9% |
| Specificity | 100% |
| Pos Pred Value | 99.9% |
| Neg Pred Value | 94.3% |
| Precision | 99.9% |
| Recall | 74.9% |
| F1 | 85.6% |
| Prevalence | 19.5% |
| Detection Rate | 14.6% |
| Detection Prevalence | 14.6% |
| Balanced Accuracy | 87.4% |
For certain statistics this model performs extremely well, with near 100% specificity, precision, and positive predictive value. Negative predictive value is also very high at around 94%.
To add: further comments on the statistics of the model.
Can we do this for this type of model?
library(pROC)
# loans_tree$prediction <- predict(creditfit, type = "vector")
loans_tree$prediction <- predict(creditfit, type = "prob")[,2]
tree_roc <- roc(credit.policy ~ prediction, data=loans_tree)
auc(tree_roc) # area-under-curve prefer 0.8 or higher.
plot(tree_roc)
The AUC of the ROC is 0.8773732, which is greater than the 0.8 threshold, so we would consider this to be a good indicator/model.
# fit linear model
linear_model <- lm(int.rate ~ fico, data=loans)
# view summary of linear model
summary(linear_model)
# scatterplot with regression line and confidence interval
# scatterplot(int.rate ~ fico, data = loans)
loans %>%
ggplot(aes(x = fico, y = int.rate)) +
geom_point(color = "steelblue", alpha = 0.2) +
geom_smooth(method = "lm", se = T) +
labs(title = "Interest Rate vs FICO Score",
x = "FICO Score", y = "Interest Rate") +
scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
scale_y_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
theme_minimal()
# fit linear model
linear_model <- lm(revol.util ~ int.rate, data=loans)
# view summary of linear model
summary(linear_model)
# scatterplot with regression line and confidence interval
# scatterplot(revol.util ~ int.rate, data = loans)
loans %>%
ggplot(aes(x = int.rate, y = revol.util)) +
geom_point(color = "steelblue", alpha = 0.2) +
geom_smooth(method = "lm") +
labs(title = "Revolving Line Utilization Rate vs Interest Rate",
x = "Interest Rate", y = "Revolving Line Utilization Rate") +
scale_x_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
scale_y_continuous(labels = label_percent(scale = 1)) +
theme_minimal()
# fit linear model
linear_model <- lm(installment ~ log.annual.inc, data=loans)
# view summary of linear model
summary(linear_model)
# scatterplot with regression line and confidence interval
# scatterplot(installment ~ log.annual.inc, data = loans)
loans %>%
ggplot(aes(x = log.annual.inc, y = installment)) +
geom_point(color = "steelblue", alpha = 0.2) +
geom_smooth(method = "lm") +
labs(title = "Installment vs Log of Annual Income",
x = "Log of Annual Income", y = "Installment") +
theme_minimal()
# fit linear model
linear_model <- lm(revol.util ~ fico, data=loans)
# view summary of linear model
summary(linear_model)
# scatterplot with regression line and confidence interval
# scatterplot(revol.util ~ fico, data = loans)
loans %>%
ggplot(aes(x = fico, y = revol.util)) +
geom_point(color = "steelblue", alpha = 0.2) +
geom_smooth(method = "lm") +
labs(title = "Revolving Line Utilization Rate vs FICO Score",
x = "FICO Score", y = "Revolving Line Utilization Rate") +
scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
scale_y_continuous(labels = label_percent(scale = 1)) +
theme_minimal()
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$not.fully.paid)
model <- train(int.rate~., data=loans, method="lm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)
#names(getModelInfo())
model <- lm(int.rate ~ fico+installment+purpose+dti+inq.last.6mths+credit.policy+log.annual.inc, data = loans)
summary(model)
model$coefficients
model$call
coef(model)
## (Intercept) fico installment
## 0.4819228196 -0.0005143307 0.0000428540
## purposecredit_card purposedebt_consolidation purposeeducational
## -0.0033260863 -0.0012546309 0.0001820383
## purposehome_improvement purposemajor_purchase purposesmall_business
## 0.0019883173 0.0014739586 0.0157732687
## dti inq.last.6mths credit.policyTRUE
## 0.0001682692 0.0005470733 -0.0020864159
## log.annual.inc
## -0.0008158434
confint(model)
## 2.5 % 97.5 %
## (Intercept) 4.731186e-01 4.907271e-01
## fico -5.236009e-04 -5.050605e-04
## installment 4.107866e-05 4.462934e-05
## purposecredit_card -4.423902e-03 -2.228270e-03
## purposedebt_consolidation -2.099536e-03 -4.097257e-04
## purposeeducational -1.609548e-03 1.973625e-03
## purposehome_improvement 5.855629e-04 3.391072e-03
## purposemajor_purchase -1.358131e-04 3.083730e-03
## purposesmall_business 1.434543e-02 1.720111e-02
## dti 1.199051e-04 2.166333e-04
## inq.last.6mths 3.765454e-04 7.176012e-04
## credit.policyTRUE -3.076261e-03 -1.096571e-03
## log.annual.inc -1.402356e-03 -2.293305e-04
library("broom")
tidyfinal <- tidy(model)
tidyfinal
Model_Summary <- augment(model)
str(Model_Summary)
head(Model_Summary)
par(mfrow=c(2,2))
plot(model)
vif(model)
lmtest::bptest(model)
car::ncvTest(model)
distBCMod <- caret::BoxCoxTrans(loans$int.rate)
print(distBCMod)
loans <- cbind(loans, dist_new=predict(distBCMod, loans$int.rate))
head(loans)
mod1 <- lm(dist_new ~ fico+installment+purpose+revol.util+log.revol.bal+credit.policy+log.annual.inc, data=loans)
summary(mod1)
lmtest::bptest(mod1)
par(mfrow=c(2,2))
plot(mod1)
To study the effects on admission by the factor purpose
(credit.policy and purpose are both categorical variables), wee can
create two-way contingency table of the outcome and predictors, and make
sure there are no cells of zero frequencies.
*A contingency table, sometimes called a two-way frequency table, is a
tabular mechanism with at least two rows and two columns used in
statistics to present categorical data in terms of frequency counts.
credit.policypurposetable = xtabs(~ credit.policy + purpose, data = loans)
credit.policypurposetable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
# prepare training scheme
loans$credit.policy = as.factor(loans$credit.policy)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$credit.policy)
model <- train(credit.policy~., data=loans, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)
#names(getModelInfo())
<<<<<<< Updated upstream
loans$credit.policy <- factor(loans$credit.policy)
str(loans)
loans$purpose <- factor(loans$purpose)
#str(credit.policy)
credit.policyLogit <- glm(credit.policy ~ inq.last.6mths+fico+log.annual.inc+dti+delinq.2yrs+pub.rec+purpose , data = loans, binomial(link = "logit") )
We can see the summary of the logit model here:
summary(credit.policyLogit)
xkabledply(credit.policyLogit, title = "Logistic Regression :")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -23.2914 | 1.0717 | -21.7330 | 0.0000 |
| inq.last.6mths | -0.8452 | 0.0223 | -37.8305 | 0.0000 |
| fico | 0.0357 | 0.0013 | 26.7228 | 0.0000 |
| log.annual.inc | 0.1406 | 0.0578 | 2.4337 | 0.0149 |
| dti | -0.0205 | 0.0053 | -3.8755 | 0.0001 |
| delinq.2yrs | -0.0255 | 0.0555 | -0.4602 | 0.6454 |
| pub.rec | 0.2791 | 0.1265 | 2.2056 | 0.0274 |
| purposecredit_card | 0.0268 | 0.1188 | 0.2255 | 0.8216 |
| purposedebt_consolidation | 0.4081 | 0.0894 | 4.5636 | 0.0000 |
| purposeeducational | -0.0394 | 0.1848 | -0.2131 | 0.8312 |
| purposehome_improvement | 0.2858 | 0.1609 | 1.7764 | 0.0757 |
| purposemajor_purchase | 0.3689 | 0.1927 | 1.9141 | 0.0556 |
| purposesmall_business | 0.1689 | 0.1557 | 1.0847 | 0.2780 |
Before moving on, let us look at the model object
credit.policyLogit a little deeper. The fitted values can
be found from credit.policyLogit$fitted.values. And the
first fitted value is 0.9904137. This is the probability of being
credit.policyted for data point #1. Compare to the value from
predict(credit.policyLogit) to be 4.6377906.
The predict() function gives us the logit value. You can
exponentiate to get the odds ratio p/q as 103.315827. And finally, we
can find p from p/q, and indeed it is confirmed to be 0.9904137.
The easier way to get that is simply use
predict(credit.policyLogit, type = c("response"))[1] =
0.9904137. The predict() function will also allow you to
find model prediction with unseen/untrained data points where
fitted.values do not give.
p_fitted = credit.policyLogit$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)
This is stored in the model as the fitted value for the probability
p of the first data point. Since the actual data point is a
0/1 True/False value, it is not easy to directly compare them unless we
use a cutoff value (default as 0.5) to convert the probability
p to 0/1.
Now, for unseen data point, we can use the predict( )
function to find the model predicted values. But be careful of what you
are getting with the predict() function in classification
models. Let us compare the three options below. For easy comparison, let
us use the same data point in the dataset as an example.
# This gives you the predicted values of the data points inside the model.
predict(credit.policyLogit) # the is from the model, which gives you the value for logit(p) or ln(p/q)
We can easily determine the confidence intervals of each coefficient with these two slightly different ways:
<<<<<<< Updated upstream
## CIs using profiled log-likelihood
# confint(credit.policyLogit)
xkabledply( confint(credit.policyLogit), title = "CIs using profiled log-likelihood" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -25.4097 | -21.2079 |
| inq.last.6mths | -0.8895 | -0.8020 |
| fico | 0.0331 | 0.0384 |
| log.annual.inc | 0.0275 | 0.2539 |
| dti | -0.0309 | -0.0101 |
| delinq.2yrs | -0.1330 | 0.0847 |
| pub.rec | 0.0355 | 0.5317 |
| purposecredit_card | -0.2052 | 0.2606 |
| purposedebt_consolidation | 0.2326 | 0.5832 |
| purposeeducational | -0.3970 | 0.3282 |
| purposehome_improvement | -0.0260 | 0.6051 |
| purposemajor_purchase | -0.0020 | 0.7543 |
| purposesmall_business | -0.1333 | 0.4773 |
## CIs using standard errors
# confint.default(credit.policyLogit)
xkabledply( confint.default(credit.policyLogit), title = "CIs using standard errors" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -25.3919 | -21.1909 |
| inq.last.6mths | -0.8890 | -0.8014 |
| fico | 0.0331 | 0.0383 |
| log.annual.inc | 0.0274 | 0.2538 |
| dti | -0.0309 | -0.0101 |
| delinq.2yrs | -0.1343 | 0.0832 |
| pub.rec | 0.0311 | 0.5271 |
| purposecredit_card | -0.2060 | 0.2596 |
| purposedebt_consolidation | 0.2328 | 0.5834 |
| purposeeducational | -0.4017 | 0.3229 |
| purposehome_improvement | -0.0295 | 0.6012 |
| purposemajor_purchase | -0.0088 | 0.7467 |
| purposesmall_business | -0.1363 | 0.4741 |
This is just one of the many libraries you can find the confusion matrix. It is easy to use, but not very powerful, lacking ability to choose cutoff value, and it does not give you all the metrics like accuracy, precision, recall, sensitivity, f1 score etc. Nonetheless, it’s handy.
loadPkg("regclass")
# confusion_matrix(credit.policyLogit)
xkabledply( confusion_matrix(credit.policyLogit), title = "Confusion matrix from Logit Model" )
| Predicted FALSE | Predicted TRUE | Total | |
|---|---|---|---|
| Actual FALSE | 1099 | 769 | 1868 |
| Actual TRUE | 221 | 7489 | 7710 |
| Total | 1320 | 8258 | 9578 |
unloadPkg("regclass")
Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.
loadPkg("pROC") # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
prob=predict(credit.policyLogit, type = "response" )
loans$prob <- NA
loans$prob=prob
h <- roc(credit.policy~prob, data=loans)
auc(h) # area-under-curve prefer 0.8 or higher.
plot(h)
# unloadPkg("pROC")
head(loans)
We have here the area-under-curve of 0.8873879, which is more than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is considered a good fit.
McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.
credit.policyNullLogit <- glm(credit.policy ~ 1, data = loans, family = "binomial")
mcFadden = 1 - logLik(credit.policyLogit)/logLik(credit.policyNullLogit)
mcFadden
Or we can use other libraries. The pscl (Political
Science Computational Lab) library has the function pR2()
(pseudo-\(R^2\)) will do the trick.
loadPkg("pscl") # use pR2( ) function to calculate McFadden statistics for model eval
credit.policyLogitpr2 = pR2(credit.policyLogit)
credit.policyLogitpr2
unloadPkg("pscl")
With the McFadden value of 0.41736, which is analogous to the coefficient of determination \(R^2\), only about 51% of the variations in y is explained by the explanatory variables in the model.
A major weakness of the overall model is likely from the small dataset sample size of 9578. We expect a much higher number of observations will increase the sensitivity of the model.
The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.
loadPkg("ResourceSelection") # function hoslem.test( ) for logit model evaluation
credit.policyLogitHoslem = hoslem.test(loans$credit.policy, fitted(credit.policyLogit)) # Hosmer and Lemeshow test, a chi-squared test
The result is shown here:
credit.policyLogitHoslem
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: loans$credit.policy, fitted(credit.policyLogit)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.
The p-value of 0 is relatively low. This indicates the model is y a good fit, despite all the coefficients are significant.
To study the effects on admission by the factor purpose
(not.fully.paid and purpose are both categorical variables), wee can
create two-way contingency table of the outcome and predictors, and make
sure there are no cells of zero frequencies.
*A contingency table, sometimes called a two-way frequency table, is a
tabular mechanism with at least two rows and two columns used in
statistics to present categorical data in terms of frequency counts.
<<<<<<< Updated upstream
not.fully.paidpurposetable = xtabs(~ not.fully.paid + purpose, data = loans)
not.fully.paidpurposetable
set.seed(7)
# load the library
library(mlbench)
library(caret)
loans$not.fully.paid = as.factor(loans$not.fully.paid)
# load the dataset
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$not.fully.paid)
model <- train(not.fully.paid~., data=loans, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)
#names(getModelInfo())
loans$not.fully.paid <- factor(loans$not.fully.paid)
str(loans)
loans$purpose <- factor(loans$purpose)
#str(not.fully.paid)
not.fully.paidLogit <- glm(not.fully.paid ~., data = loans, binomial(link = "logit") )
We can see the summary of the logit model here:
summary(not.fully.paidLogit)
xkabledply(not.fully.paidLogit, title = "Logistic Regression :")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 73.6933 | 20.3074 | 3.6289 | 0.0003 |
| credit.policyTRUE | -0.2791 | 0.0965 | -2.8909 | 0.0038 |
| purposecredit_card | -0.5149 | 0.1100 | -4.6808 | 0.0000 |
| purposedebt_consolidation | -0.3238 | 0.0789 | -4.1045 | 0.0000 |
| purposeeducational | 0.0555 | 0.1525 | 0.3643 | 0.7157 |
| purposehome_improvement | 0.1002 | 0.1269 | 0.7896 | 0.4297 |
| purposemajor_purchase | -0.3200 | 0.1669 | -1.9174 | 0.0552 |
| purposesmall_business | 0.5667 | 0.1174 | 4.8271 | 0.0000 |
| int.rate | -90.5900 | 28.3969 | -3.1901 | 0.0014 |
| installment | 0.0012 | 0.0002 | 6.5208 | 0.0000 |
| log.annual.inc | -0.3876 | 0.0610 | -6.3544 | 0.0000 |
| dti | -0.0005 | 0.0047 | -0.1126 | 0.9104 |
| fico | -0.0067 | 0.0016 | -4.2652 | 0.0000 |
| days.with.cr.line | 0.0000 | 0.0000 | 0.8006 | 0.4234 |
| revol.bal | 0.0000 | 0.0000 | 3.1662 | 0.0015 |
| revol.util | 0.0031 | 0.0014 | 2.1652 | 0.0304 |
| inq.last.6mths | 0.0567 | 0.0216 | 2.6242 | 0.0087 |
| delinq.2yrs | -0.1113 | 0.0976 | -1.1404 | 0.2541 |
| pub.rec | -1.4521 | 0.7184 | -2.0212 | 0.0433 |
| has.delinq.2yrsTRUE | 0.0511 | 0.1600 | 0.3194 | 0.7494 |
| has.pub.recTRUE | 1.9029 | 0.7399 | 2.5717 | 0.0101 |
| log.revol.bal | -0.0153 | 0.0310 | -0.4945 | 0.6209 |
| has.revol.balTRUE | -0.1426 | 0.2792 | -0.5108 | 0.6095 |
| dist_new | 50.0380 | 15.5003 | 3.2282 | 0.0012 |
| prob | -0.4150 | 0.2432 | -1.7067 | 0.0879 |
Before moving on, let us look at the model object
not.fully.paidLogit a little deeper. The fitted values can
be found from not.fully.paidLogit$fitted.values. And the
first fitted value is 0.1374278. This is the probability of being
not.fully.paid for data point #1. Compare to the value from
predict(not.fully.paidLogit) to be -1.83682.
The predict() function gives us the logit value. You can
exponentiate to get the odds ratio p/q as 0.1593233. And finally, we can
find p from p/q, and indeed it is confirmed to be 0.1374278.
The easier way to get that is simply use
predict(not.fully.paidLogit, type = c("response"))[1] =
0.1374278. The predict() function will also allow you to
find model prediction with unseen/untrained data points where
fitted.values do not give.
<<<<<<< Updated upstream
p_fitted = not.fully.paidLogit$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)
p_fitted
This is stored in the model as the fitted value for the probability
p of the first data point. Since the actual data point is a
0/1 True/False value, it is not easy to directly compare them unless we
use a cutoff value (default as 0.5) to convert the probability
p to 0/1.
Now, for unseen data point, we can use the predict( )
function to find the model predicted values. But be careful of what you
are getting with the predict() function in classification
models. Let us compare the three options below. For easy comparison, let
us use the same data point in the dataset as an example.
# This gives you the predicted values of the data points inside the model.
predict(not.fully.paidLogit) # the is from the model, which gives you the value for logit(p) or ln(p/q)
We can easily determin the confidence intervals of each coefficient with these two slightly different ways:
## CIs using profiled log-likelihood
# confint(not.fully.paidLogit)
xkabledply( confint(not.fully.paidLogit), title = "CIs using profiled log-likelihood" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 34.3522 | 113.9860 |
| credit.policyTRUE | -0.4672 | -0.0887 |
| purposecredit_card | -0.7328 | -0.3013 |
| purposedebt_consolidation | -0.4780 | -0.1688 |
| purposeeducational | -0.2491 | 0.3492 |
| purposehome_improvement | -0.1518 | 0.3459 |
| purposemajor_purchase | -0.6572 | -0.0017 |
| purposesmall_business | 0.3354 | 0.7957 |
| int.rate | -146.9097 | -35.5536 |
| installment | 0.0008 | 0.0015 |
| log.annual.inc | -0.5074 | -0.2683 |
| dti | -0.0099 | 0.0088 |
| fico | -0.0098 | -0.0036 |
| days.with.cr.line | 0.0000 | 0.0000 |
| revol.bal | 0.0000 | 0.0000 |
| revol.util | 0.0003 | 0.0059 |
| inq.last.6mths | 0.0148 | 0.0999 |
| delinq.2yrs | -0.3191 | 0.0651 |
| pub.rec | -3.2674 | -0.3318 |
| has.delinq.2yrsTRUE | -0.2544 | 0.3741 |
| has.pub.recTRUE | 0.7225 | 3.7449 |
| log.revol.bal | -0.0756 | 0.0459 |
| has.revol.balTRUE | -0.6917 | 0.4032 |
| dist_new | 19.9987 | 80.7807 |
| prob | -0.8891 | 0.0654 |
## CIs using standard errors
# confint.default(not.fully.paidLogit)
xkabledply( confint.default(not.fully.paidLogit), title = "CIs using standard errors" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 33.8915 | 113.4951 |
| credit.policyTRUE | -0.4682 | -0.0899 |
| purposecredit_card | -0.7305 | -0.2993 |
| purposedebt_consolidation | -0.4783 | -0.1692 |
| purposeeducational | -0.2433 | 0.3543 |
| purposehome_improvement | -0.1485 | 0.3488 |
| purposemajor_purchase | -0.6472 | 0.0071 |
| purposesmall_business | 0.3366 | 0.7968 |
| int.rate | -146.2469 | -34.9331 |
| installment | 0.0008 | 0.0015 |
| log.annual.inc | -0.5072 | -0.2681 |
| dti | -0.0098 | 0.0088 |
| fico | -0.0098 | -0.0036 |
| days.with.cr.line | 0.0000 | 0.0000 |
| revol.bal | 0.0000 | 0.0000 |
| revol.util | 0.0003 | 0.0059 |
| inq.last.6mths | 0.0143 | 0.0990 |
| delinq.2yrs | -0.3025 | 0.0800 |
| pub.rec | -2.8602 | -0.0440 |
| has.delinq.2yrsTRUE | -0.2624 | 0.3646 |
| has.pub.recTRUE | 0.4526 | 3.3531 |
| log.revol.bal | -0.0760 | 0.0454 |
| has.revol.balTRUE | -0.6898 | 0.4046 |
| dist_new | 19.6580 | 80.4180 |
| prob | -0.8916 | 0.0616 |
This is just one of the many libraries you can find the confusion matrix. It is easy to use, but not very powerful, lacking ability to choose cutoff value, and it does not give you all the metrics like accuracy, precision, recall, sensitivity, f1 score etc. Nonetheless, it’s handy.
loadPkg("regclass")
# confusion_matrix(not.fully.paidLogit)
xkabledply( confusion_matrix(not.fully.paidLogit), title = "Confusion matrix from Logit Model" )
| Predicted FALSE | Predicted TRUE | Total | |
|---|---|---|---|
| Actual FALSE | 7999 | 46 | 8045 |
| Actual TRUE | 1488 | 45 | 1533 |
| Total | 9487 | 91 | 9578 |
unloadPkg("regclass")
Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.
loadPkg("pROC") # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
prob=predict(not.fully.paidLogit, type = "response" )
loans$prob <- NA
loans$prob=prob
h <- roc(not.fully.paid~prob, data=loans)
auc(h) # area-under-curve prefer 0.8 or higher.
plot(h)
# unloadPkg("pROC")
head(loans)
We have here the area-under-curve of 0.6862751, which is less than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is not considered a good fit.
McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.
not.fully.paidNullLogit <- glm(not.fully.paid ~ 1, data = loans, family = "binomial")
mcFadden = 1 - logLik(not.fully.paidLogit)/logLik(not.fully.paidNullLogit)
mcFadden
Or we can use other libraries. The pscl (Political
Science Computational Lab) library has the function pR2()
(pseudo-\(R^2\)) will do the trick.
loadPkg("pscl") # use pR2( ) function to calculate McFadden statistics for model eval
not.fully.paidLogitpr2 = pR2(not.fully.paidLogit)
not.fully.paidLogitpr2
unloadPkg("pscl")
With the McFadden value of 0.071316, which is analogous to the coefficient of determination \(R^2\), only about 8% of the variations in y is explained by the explanatory variables in the model.
A major weakness of the overall model is likely from the small dataset sample size of 9578. We expect a much higher number of observations will increase the sensitivity of the model.
The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.
loadPkg("ResourceSelection") # function hoslem.test( ) for logit model evaluation
not.fully.paidLogitHoslem = hoslem.test(loans$not.fully.paid, fitted(not.fully.paidLogit)) # Hosmer and Lemeshow test, a chi-squared test
The result is shown here:
not.fully.paidLogitHoslem
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: loans$not.fully.paid, fitted(not.fully.paidLogit)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.
The p-value of 0 is relatively high. This indicates the model is not really a good fit, despite all the coefficients are significant.